HDDS-2107. Datanodes should retry forever to connect to SCM in an…#1424
HDDS-2107. Datanodes should retry forever to connect to SCM in an…#1424hanishakoneru merged 1 commit intoapache:trunkfrom
Conversation
…ecure environment
|
/label ozone |
|
@xiaoyuyao @hanishakoneru @anuengineer @elek @bharatviswa504 Please review |
|
💔 -1 overall
This message was automatically generated. |
adoroszlai
left a comment
There was a problem hiding this comment.
Hi @vivekratnavel,
As far as I see, DataNode already tries forever due to the main loop in the state machine:
You can verify this by starting DataNode without SCM, and setting the IP for scm to the DataNode's own address:
cd hadoop-ozone/dist/target/ozone-0.5.0-SNAPSHOT/compose/ozone
docker-compose up -d datanode
docker-compose exec datanode bash -c "tail -1 /etc/hosts | sed 's/\t\+[a-z0-9]*$/ scm/' | sudo tee -a /etc/hosts"
docker-compose logs -f --tail=10 datanode
Result:
...
datanode_1 | 2019-09-11 12:29:39 INFO Client:948 - Retrying connect to server: scm/192.168.0.2:9861. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
datanode_1 | 2019-09-11 12:29:39 ERROR EndpointStateMachine:204 - Unable to communicate to SCM server at scm:9861 for past 300 seconds.
...
datanode_1 | 2019-09-11 12:29:40 INFO Client:948 - Retrying connect to server: scm/192.168.0.2:9861. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
...
|
@adoroszlai You are right. With this change, we don't get the error from |
|
Thank you @vivekratnavel for working on this. |
|
@hanishakoneru Sure. |
|
Thank you @vivekratnavel. +1. I will commit it. |
…ecure environment (apache#1424)
…ecure environment (apache#1424)
… unsecure environment
In an unsecure environment, the datanodes try upto 10 times after waiting for 1000 milliseconds each time before throwing this error:
This PR fixes that issue by having datanodes try forever to connect with SCM and not throw an error from the state machine.
I have also increased timeouts on a unit test to improve its stability.